Quantum CISC Compilation by Optimal Control and 
Scalable Assembly of Complex Instruction Sets beyond Two-Qubit Gates 
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We present a quantum CISC compiler and show how to assemble complex instruction sets in a 
scalable way. Enlarging the toolbox of universal gates by optimised complex multi-qubit instruction 
sets thus paves the way to fight relaxation for realistic experimental settings. 

Compiling a quantum module into the machine code for steering a concrete quantum hardware 
device lends itself to be tackled by means of optimal quantum control. To this end, there are two 
opposite approaches: (i) one may use a decomposition into the restricted instruction set (risc) 
of universal one- and two-qubit gates and translate them into the machine code, or (ii) one may 
prefer to generate the entire target module as a complex instruction set (ciSC) directly by evoltution 
under drift and available controls. Here we advocate direct compilation up to the limit of system 
size a classical high-performance parallel computer cluster can reasonably handle. For going beyond 
these limits, i.e. for large systems, we propose a combined way, namely (iii) to make recursive use of 
medium-sized building blocks generated by optimal control in the sense of a quantum CISC compiler. 

The advantage of the method over standard RISC compilations into one- and two-qubit universal 
gates is explored on the parallel cluster hlrb-ii (with a total linpack performance of 63.3 TFlops/s) 
for the quantum Fourier transform, the indirect SWAP gate as well as for multiply-controlled NOT 
gates. Implications for upper limits to time complexities are also derived. 

PACS numbers: 03.67.-a, 03.67.Lx, 03.65.Yz, 03.67.Pp; 82.56.-b, 82.56.Jn, 82.56.Dj, 82.56.Fk 



Introduction 

Richard Fcynman's seminal conjecture of using exper- 
imentally controllable quantum systems to perform com- 
putational tasks [l|, roots in reducing the complexity 
of the problem when moving from a classical setting to 
a quantum setting. The most prominent pioneering ex- 
ample being Shor's quantum algorithm of prime factori- 
sation [3, [j] which is of polynomial complexity (bqp) 
on quantum devices instead of showing non-polynomial 
complexity on classical ones 5| . It is an example of a class 
of quantum algorithms @, [7| that solve hidden subgroup 
problems in an efficient way where in the Abelian 
case, the speed-up hinges on the quantum Fourier trans- 
form (qft). Whereas the network complexity of the fast 
Fourier transform (fft) for n classical bits is of order 
0(n2") add, the QFT for n qubits shows a complexity 
of order 0{v?). Moreover, Feynman's second observation 
that quantum systems may be used to efficiently predict 
the behaviour of other quantum systems has inaugurated 
a branch of research dedicated to Hamiltonian simulation 

For implementing a quantum algorithm in an exper- 
imental setup, local operations and universal two-qubit 
quantum gates are required as a minimal set ensuring ev- 
ery unitary module can be realised . More recently, it 
turned out that generic qubit and qudit pair interaction 
Hamiltonians suffice to complement local actions to uni- 
versal controls [H, [l^ . Common sets of quantum com- 
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putational instructions comprise (i) local operations such 
as the Hadamard gate, the phase gate and (ii) the entan- 
gling operations CNOT, controlled-phase gates, VswAP, 
i SWAP as well as (iii) the swap operation. The number 
of elementary gates required for implementing a quantum 
module then gives the network or gate complexity. 

As is well known, a generic ri-qubit generic operation 
requires exponentially many two-qubit gates to be imple- 
mented exactly [13:l2l|, the complexity being 0(n4"). 
Yet, as has been pointed out by Barenco et at, many 
quantum computationally pertinent gates can be decom- 
posed into a number of one- and two-qubit gates increas- 
ing linearly with the number of qubits. At the expense 
of a single ancilla qubit this also holds for multiply con- 
trolled unitary gates [20j tantamount to error correction. 
For an overview, see e.g. [2^ . [23l . [23 |. Moreover, Blais 
[2^ showed how to implement the QFT with linear gate 
complexity. Later, Solovay ([2^ quoted in and p^ ) 
and then Kitaev addressed the problem to approximate 
arbitrary unitary gates by poiynomiaiiyiong 2-qubit gate 
sequences up to a given precision [2^, [23] ■ More recently 
the bounds of approximating an arbitrary unitary were 
taken down to a polynomial of sixth-order in the number 
of qubits and of third order in the geodesic distance of 
the unitray to unity ^28] . Differential geometric aspects 
in terms of Finsler metrics have been raised in poj . 

However, gate complexity often translates into too 
coarse an estimate for the actual time required to imple- 
ment a quantum module (see e.g. [sol. [3ll. [3^ ) . in partic- 
ular, if the time scales of a specific experimental setting 
have to be matched. Instead, effort has been taken to 
give upper bounds on the actual time c omp lexity [33| . 
e.g., by way of numerical optimal control [3J]. 
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Figure 1: Compilation in classical computation (left) and quantum computation (right). Quantum machine code has to be 
time-optimal or protected against relaxation, otherwise the coherent superpositions are wiped out. A quantum RISC-compiler 
(1) by universal gates leads to unnecessarily long machine code. Direct CISC-compilation into a single pulse sequence (2) 
exploits quantum control for a near time-optimal quantum machine code. Its classical complexity is np, so direct compilation 
by numerical optimal control resorting to a classical computer is unfeasible for large quantum systems. The third way (3) 
promoted here pushes quantum CISC-compilation to the limits of classical supercomputer clusters and then assembles the 
multi-qubit complex instructions sets recursively into time-optimised or relaxation-protected quantum machine code. 



Interestingly, in terms of quantum control theory, the 
existence of universal gates is equivalent to the state- 
ment that the quantum system is fully controllable as 
has first been pointed out in Ref. [S^l- This is, e.g., 
the case in systems of n spin-^ qubits that form Ising- 
type weak-coupling topologies described by arbitrary 
connected graphs |36l . |37| . Therefore the usual approach 
to quantum compilation in terms of local plus univer- 
sal tvifo-qubit operations [H, [H, HO, [4l|, \^ lends it- 
self to be complemented by optimal-control based di- 
rect compilation into machine code: it may be seen as a 
technology-dependent optimiser in the sense of Ref. [4l| , 
however, tailored to deal with more complex instruction 
sets than the usual local plus two-qubit building blocks. 
Not only is it adapted to the specific experimental set- 
ting, it also allows for fighting relaxation by either being 
near timeoptimal or by exploiting relaxation-protected 
subspaces |43| . Devising quantum compilation methods 
for optimised realisations of given quantum algorithms by 
admissible controls is therefore an issue of considerable 
practical interest. Here it is the goal to show how quan- 
tum compilation can favourably be accomplished by op- 
timal control: the building blocks for gate synthesis will 
be extended from the usual set of restricted local plus 
universal two-qubit gates to a larger toolbox of scalable 
multi-qubit gates tailored to yield high fidelity in short 
time given concrete experimental settings. 



Quantum Compilation as an Optimal Control Task 

As shown in Fig. [2 the quantum compilation task can 
be addressed following different principle guidelines: (1) 
by the standard decomposition into local operations and 
universal two-qubit gates, which by analogy to classical 
computation was termed reduced instruction set quantum 
computation (risc) [ill or (2) by using direct compila- 
tion into one single complex instruction set (ciSC) [4J|. 
The existence of a such a single effective gate is guaran- 
teed simply by the unitarics forming a group: a sequence 



of local plus universal gates is a product of unitaries and 
thus a single unitary itself. 

As a consequence, CISC quantum compilation lends it- 
self for resorting to numerical optimal control (on clus- 
ters of classical computers) for translating the unitary 
target module directly into the 'machine code' of evolu- 
tions of the quantum system under combinations of the 
drift Hamiltonian Hq and experimentally available con- 
trols Hj. 

In a number of studies on quantum systems up to 
10 qubits, we have shown that direct c omp ilation by 
gradient-assisted optimal control [H, H^, |46| | allows for 
substantial speed-ups, e.g., by a factor of 5 for a CNOT 
and a factor of 13 for a Toffoli-gate on coupled Joseph- 
son qubits [1^. However, the direct approach naturally 
faces the limits of computing quantum systems on clas- 
sical devices: upon parallelising our C++ code for high- 
performance clusters [i^l, we found that extending the 
quantum system by one qubit increases the CPU-time re- 
quired for direct compilation into the quantum machine 
code of controls by roughly a factor of eight. So the 
classical complexity for optimal-control based quantum 
compilation is np. 

Therefore, here we advocate a third approach (3) that 
uses direct compilation into multi-qubit complex instruc- 
tion sets up to the CPU-timc limits of optimal quantum 
control on classical computers: these building blocks are 
designed such as to allow for recursive scalable quantum 
compilation in large quantum systems (i.e. those beyond 
classical computability) . In particular, the complex in- 
struction sets may be optimised such as to fight relax- 
ation by being near time-optimal, or, moreover, they may 
be devised such as to fight the specific imperfections of 
an experimental setting. 



Controllability 

Before turning to optimal-control based CISC quantum 
compilation in more detail, it is important to ensure the 
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quantum control system characterised by {-ffo} U {Hj} is 
in fact fully controllable. 

Hamiltonian quantum dynamics following 
Schrodinger's equation for the unitary image of a 
complete basis set of 'state vectors' representing a 
quantum gate 

m 

\m) = -z{Ha + Y.u,{t)H,)\i>{t)) (1) 

rn 

U{t) ^ -i{Ha + Y,Uj{t)H,)U{t) , (2) 

resembles the setting of a standard bilinear control sys- 
tem with state X{t), drift A, controls Bj, and control 
amplitudes Uj G R reading 

m 

X{t)^{A + Y,^,{t)B,) X{t) , (3) 

where X{t) G GLn{C) and A,Bj e Mat Ar(C). Clearly 
in the dynamics of closed quantum systems, the system 
Hamiltonian Hd is the drift term, whereas the Hj are the 
control Hamiltonians with Uj{t) as control amplitudes. 
In systems of n qubits, e C^", U G S'[/(2"), and 
iH^ g su(2"). 

A system is fully operator controllable, if to every initial 
state po the entire unitary orbit C'u(po) := {UpoW \ U £ 
SU{N)} can be reached. With density operators being 
Hermitian this means any final state p{t) can be reached 
from any initial state po as long as both of them share 
the same spectrum of eigenvalues. 

As established in [i^ , the bilinear system of Eqn. [2] is 
fully controllable if and only if the drift and controls are 
a generating set of su{N) by way of the commutator, i.e., 
{Hd, H,\j = 1,2,..., m)ue = 5u{N). 

Example 1 Consider a system of n weakly coupled 
spin-i qubits. Let cr^ = ( i J ), ^Ty = (i ~o)' '^^ 
( -1 ) tie the Pauli matrices. In n spins- i, a CTkx for spin 
k is tacitly embedded as l^-'-l^a^^l^---! where 
ax is at position k. The same holds for aky, cr^z, and in 
the weak coupling terms <7kz<^ez with 1 < k < £ < n. 

Now a system of n qubits is fully controllable [s^, if 
e.g. the control Hamiltonians Hj comprise the Pauli ma- 
trices {(Tfcj,, aky I fc = 1, 2, . . . n} on every single qubit se- 
lectively and the drift Hamiltonian Hd encompasses the 
Ising pair interactions {Jki {akzO'e.z) /"i- \ k < i = 2, . . .n], 
where the coupling topology of Ju 7^ may take the 
form of any connected graph. This theorem has mean- 
while been generalised to other coupling types [i^, [HO] . 

In view of the compilation task in quantum computation 
we get the following synopsis: 

Corollary 1 The following are equivalent: 

(1) in a quantum system of n coupled spins-^, the drift 
Hd and the controls Hj form a generating set of 
su(2"); 



(2) the quantum system is operator controllable (in the 
sense of Ref. iSJil): 

(3) every unitary transformation U G SU{2") can be 
realised by that system; 

(4-) there is a set of universal quantum gates for the 
quantum system. 

Proof: The equivalence of (1) and (2) relies on the uni- 
tary group being a compact connected Lie group: com- 
pact connected Lie groups have no closed subsemigroups 
that are no groups themselves [i^. Moreover, in com- 
pact connected Lie groups the exponential mapping is 
surjective, hence (1) (3). Assertions (3) and (4) just 
re-express the same fact in different terminology. I 



Scope and Organisation of the Paper 

The purpose of this paper is to show that optimal con- 
trol theory can be put to good use for devising multi- 
qubit building blocks designed for scalable quantum com- 
puting in realistic settings. Note these building blocks are 
no longer meant to be universal in the practical sense that 
any arbitrary quantum module should be built from them 
(plus local controls). Rather they provide specialised sets 
of complex instructions tailored for breaking down typi- 
cal tasks in quantum computation with substantial speed 
gains compared to the standard compilation by decom- 
position into one-qubit and two-qubit gates. Thus a CISC 
quantum compiler translates into significant progress in 
fighting relaxation. 

For demonstrating quantum CISC compilation and scal- 
able assembly, in this paper we choose systems with linear 
coupling topology, i.e., qubit chains coupled by nearest- 
neighbour Ising interactions. The paper is organised as 
follows: CISC quantum compilation by optimal control 
will be illustrated in three different, yet typical examples 

(1) the indirect l,7i-SWAP gate, 

(2) the quantum Fourier transform (qft) , 

(3) the generalisation of the CNOT and Toffoli gate to 
multiply-controlled not gates, c"not. 

For every instance of n-qubit systems, we analyse the 
effects of (i) sacrificing universality by going to special 
instruction sets tailored to the problem, (ii) extending 
pair interaction gates to effective multi-qubit interaction 
gates, and (iii) we compare the time gain by recursive m- 
qubit CISC-compilation (m < n) to the two limiting cases 
of the standard RISC-approach (to = 2) on one hand and 
the (extrapolated) time-complexity inferred from single- 
CISC compilation (with to = n). 
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Preliminaries 



Time Standards 



When comparing times to implement unitary target 
gates by the RISC vs the CISC approach, we will assume 
for simplicity that local unitary operations are 'infinitely' 
fast compared to the duration of the Ising coupling evo- 
lution so that the total gate time is solely determined 
by the coupling evolutions unless stated otherwise. Let 
us emphasise, however, this stipulation only concerns 
the time standards. The optimal-control assisted CISC- 
compilation methods presented here are in no way lim- 
ited to fast local controls. In particular, also the as- 
sembler step of concatenating the CiSC-building blocks is 
independent of the ratio of times for local operations vs 
coupling interactions. 



Overview on Gate and Time Complexities 

For practical purposes, the complexity of a unitary 
quantum operation can be expressed in terms of two mea- 
sures: the gate complexity counts the number of univer- 
sal one- and two-qubit gates for exactly implementing 
the target operation in a circuit. Moreover, in view of 
fighting relaxation, we will estimate the time complex- 
ity in terms of consecutive time-slots with simultaneous 
m-qubit modules required. 

In order not to raise false expectations, upon changing 
from universal 2-qubit decompositions (risc) to m-qubit 
CISC-implementations the gate complexity for exact im- 
plemention of a generic n-qubit unitary operation clearly 
remains np: it requires 'exponentially many' 2-qubit 
modules or m-qubit modules (m > 2) alike, yet a cut 
from the order of roughly 4"/4^ necessary 2-qubit mod- 
ules down to some 4"/4"' m-qubit modules (with up to 
m = 10) is substantial and particularly valuable in few- 
qubit systems. More elaborate estimates will be given 
shortly. — Likewise, also in target modules with linear 2- 
qubit RISC complexity, m-qubit CISC complexity remains 
linear, yet when translated into time complexity it may 
entail sizeable speed-ups - we will show examples where 
they allow for accelerations by more than a factor of 13. 

To be more precise, a lower bound for the number 
of two-qubit gates necessary to exactly implement a 
a generic n-qubit unitary target module was given by 
Barenco et al. (20| . Their parameter-counting argument 
is based on a gem, which deserves to be picked up for 
generalising it to realisations by m-qubit modules as il- 
lustrated in Fig. [51 The key is that only in the first 
time slot the number of parameters directly relates to 
the unitary group, while from the second slot onwards 
the parameters have to be counted in terms of cosets of 
the form S'C/(2'")/(5't7(2™-'') ® SU{2f)), if the m-qubit 
module has overlaps of fi qubits and (m — fi) qubits with 
the two adjacent modules in the time slot before. The 



number of real parameters (denoted by # for short) in 
the respective basic building bocks amount to 



# 



# 



#S'C/(2™) 

SU{2"') 
S'[/(2™-A') (g, 

5L/(2™) 



= 4"-l (4) 

= 4™ - 1 - (4™"'' - 1) 

= 4"-^(4^-l) (5) 



1 - (4" 



1) - (4^ - 1) 



(4™-M_i)(4M_i) (6) 



With these stipulations one may readily determine the 
number of m-qubit gates in a unitary network of the type 
of Fig. [5] a, where — is integer, such as to ensure to 
exhaust the number 4" — 1 of parameters of a generic 
n-qubit target gate to be implemented. In the first time 
slot there are ^ parallel m-qubit gates (counting by the 
number of parameters in the group according to Eqn|4]), 
in the second time slot there are ( — — 1) parallel m-qubit 
gates. They contribute the number of parameters of the 
coset (Eqn. [H]), where one is forced to choose ^ for 
even m and ^ = ^(m ± 1) for odd m in order to be 
efficient. Following the same Margolus pattern one adds 
as many m-qubit gates (counting cosets) as required to 
supcrseed 4" — 1 parameters. Using Gauss' brackets one 
thus obtains the number g,„ of m-qubit gates needed to 
implement a generic n-qubit target gate 



gn 



4" - 1 - ^(4™ - 1) 



(7) 



and the respective number of time slots = [-j^j] by 
4*^ _ 1 _ IL('4"^ _ 1 ) 



im — 1 



(a) 



— (4™ — 4M — 4'"-^ + 1) 



(b) (c) 



(8) 
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Figure 2: Decomposition of an n-qubit gate into a circuit 
of m-qubit gates, where m is a uniform block size and may 
consist of RISC modules m = 2 or CISC modules with m > 2. 
(a) Margolus pattern with ^ integer, (b) ?i — m[^J — ^ or 
(c) n — rn[^\ — 2fi, so /i > integer. 
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Table I: Lower Bounds to Gate Complexities and Time Complexities for Implementing Generic n-Qubit Unitaries S'f/(2") 



n-qubit operation with n = 


2 


3 


4 


5 


6 


7 


8 


9 


10 


20 


100 


no of 2-qubit gates (§2) 


1 


6 


27 


112 


453 


1,818 


7,279 


29, 124 


116,505 


1.22 X 10" 


1.79 X lO'^^ 


no of 2-qubit time slots (t2) 


1 


6 


14 


56 


151 


606 


1,820 


7, 281 


23, 301 


1.22 X 10^° 


3.57 X 10" 


no of 10-qubit gates (,gio) 


















1 


1,050,627 


1.54 X 10^"* 


no of 10-qubit time slots (tio) 


















1 


525, 314 


1.54 X lO'^^ 



For even n with m = 2 and u = 1 Eqn. [7] specialises to 
reproduce the result of Ref. [20|, i.e. 52 = ^(4" — 37i— 1). 

Next, consider Fig.[2]b and its Margolus pattern with 
one overhead of ^ = 71 — "^LmJ Qubits to be taken into 
account by Eqn. [5] Then the same arguments give 



r 4" 



-M„L^J(4m„l) 



4m _ 4^ _ 4r; 



1 



L-J 

m 



(9) 



worked examples in the current study. — Since generic 
and thus highly entangled states have recently turned 
out to be computationally of modest use [s^l, recasting 
the above analysis in terms of 2-designs and i-designs 
[ssl . [5^ . [55} and following concentrations of measure will 
give a more realistic estimate, which is part of a different 
project. 



4" _ 4™-^ _ I^ILJ (4™ _ 1) 
[^J (4m _ 4M — 4"i-Ai + 1) 



(10) 



Finally, for a pattern with two such overheads as in 
Fig. [2]c, where n — r?T,[— J = 2/i, one likewise finds 



4" + 1 - 2 • 4""^ - [^J (4™ - 1) 



4^ 



1-2-4" 



L-J 

m 



[ij (4m - 4^ - 4m-M + 1) 



(11) 



(12) 



With efRcient implementations requiring fi to be closest 
to 77i/2 {vide supra), three overheads do not occur. 

Since > 5m > g'm, one may use g!^ with the most 
efficient setting oi fi = for m even or /i = ^{m — 1) 
for m odd as a lower bound for the number of unitary 
771-qubit modules necessary to exactly implement an ar- 
bitrary generic ri-qubit target unitary. 

In the limit of large n, one thus obtains the bounds on 
gate complexities §2 — 4"/9 and 510 — 4"/l,046,529 so 
510/52 — 1/116, 281. Likewise the limiting time complex- 
ities t2 ~ 2 • 4"/(n • 9) and tio ~_10 • 4"/(n • 1, 046, 529) 
give a speed-up potential of <io/i2 — 1/23,256 in units 
of the ratio of single-gate times tiq / T2 in the respective 
experimental setting. These limiting speed-up ratios are 
nearly reached already for n = 10, as the numbers given 
in Tab.Ushow. In this sense, accelerations may be taken 
as roughly constant over the entire range of interest. 

Although in generic n-qubit unitaries, the CISC speed- 
up may appear overwhelming, quantum algorithms are 
usually by construction resorting to highly non-generic 
unitary bulding blocks, many of which with linear com- 
plexities [20| . However, in these seemingly less rewarding 
yet practically relevant cases CISC compilation will turn 
out to be highly advantageous as demonstrated in three 



Error Propagation and Relaxative Losses 

As the main figure of merit we refer to a quality func- 
tion 



q ^tr 



(13) 



resulting from the fidelity i^tr and the relaxative decay 
with overall relaxation rate constant Tn during a du- 
ration T assuming independence of fidelity and decay. 
Moreover, for n qubits one defines as the trace fidelity 
of an experimental unitary module Ucxp with respect to 
the target gate Vtargct the quantity 



Ft, := i Re tT{V,l^,,U,,^} 

^ 1 ^ TN ll^argot ~ f^cxp|j2 i 



(14) 



where both U,V e U{N) with N ~ 2". It follows via 
the simple relation to the Euclidean distance 

ll^-C/||2 = ||C/||2 + j|y||2_2Rctr{yt[/} 
= 2N -2N^RetT{V^U} 

= 2iV(l - Ftr) , 

the latter two identities invoking unitarity of U, V. The 
reason for chosing the trace fidelity is its convenient 
Frechet differentiability in view of gradient-flow tech- 
niques, sec also Ref. j56j . 

Consider an rn-qubit-interaction module (ciSC) with 



quality = Fm e 



that decomposes into r uni- 



versal two-qubit gates (risc), out of which r' < r 
gates have to be performed sequentially. Moreover, each 
2-qubit gate shall be carried out with the uniform qual- 
ity (72 = F2 e"'^^/"^^. Henceforth we assume for simplicity 
equal relaxation rate constants, so T2 ~ are identi- 
fied with Tr. Then, as a first useful rule of the thumb 




Figure 3: (Colour online) Comparison of error-propagation models for random unitary gates with m — 2 qubits (a) and m — 8 
qubits (b) requiring representations with different scales. Single gate fidelity in the Monte Carlo simulations is Fm = 0.99999. 
Repetition of the same gate A (blue) is compared with repetitions of a sequence of four independent gates ABCD (black). 
Out of 10 Monte Carlo simulations (details see text), the median (solid lines) as well as the best and worst cases (dashed lines) 
are given. The red solid lines denote independent error propagation _Ftr = {FmY ■ Large systems (m = 8) with several gates 
[ABCD) resemble independent error propagation almost perfectly, as in (b) the black and the red solid lines virtually coincide. 



and assuming independent error propagation, it is ad- 
vantageous to compile the m-qubit module directly if 
Fm > iF2Y ■ Or more precisely taking relaxation into 
account, if the module can be realised with a fidelity 

i^^ > (^^2r e-('-'-^^-^'")/^« . (15) 

A more refined picture emerges from Monte-Carlo sim- 
ulations of error propagation. To this end, compare the 
above independent error estimates with two scenarios for 
a sequence of r gates in total: (i) the r-fold repetition 
of single unitary gates A with individual errors meant 
to give with r = 1, 2, 3, . . . and (ii) the repetition of 
a sequence of four different gates A,B,C,D again each 
with individual errors to give {D o C o B o AY^^ where 
r = 4, 8, 12, . . . . In the sequel, we refer to case (i) as 
AAAA and to case (ii) as ABCD. 

For gates and errors to be generic, we use random uni- 
taries (distributed according to the Haar measure fol- 
lowing a recent modification [s^l of the QR-algorithm) . 
To a given random unitary m-qubit gate G U(2"^) 
(defining its Hamiltonian Hao via Aq = e^*^-*") we sim- 
ulate a generic error as follows: from another indepen- 
dent unitary Ej take the matrix logarithm Haj such that 
g-iHAj — jt;^_ Then to a given trace fidelity F, a corre- 
sponding unitary with a Monte-Carlo random error (the 
error being introduced on the level of the Hamiltonian 
generators) can readily be obtained by solving 

(16) 



for (5 > 0. Along these lines one obtains the Monte-Carlo 
fidelities for repeating the ^-gate by 

FAAAA{r) = l-^\\{AoY-f[A,\\l (17) 

and 

r 

FABCoir) = l-^\\{DoCoBoAofi-Yl{D,C,B,A,)\\l 

(18) 

where the product runs from right to left. These Monte- 
Carlo simulations are compared to the simple model of 
independent errors according to 

Fnd^iF„rY ■ (19) 

As shown in Fig. [3] a, for two-qubit gates the error prop- 
agates with a vast variance, which makes it virtually un- 
predictable. Thus assuming independence is always too 
optimistic for AAAA, while for ABCD it is still mostly 
optimistic, although there are cases in which the errors 
may compensate to give less effective loss than expected 
under independence. 

However, when moving to effective multi-qubit gates, 
i.e., CISC modules, the generic situation becomes more 
predictable. For example, in 8-qubit random unitary 
gates, Fig.[3]b shows that AAAA is significantly deviating 
from independent error propagation, whereas ABCD re- 
sembles independent error propagation almost perfectly. 
The situation is qualitatively exactly the same even if the 
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Figure 4: (a) Simple starting point: building a SWAPi,4 gate from five SWAPSi,2- (b) Generalisation: assembling a SWAPi,„ by 
four SWAPsi,mj for each type j = 1, 2, . . . , fc — 1 and one single SWAPi.mfc so that mk + 2 X]j=i ("^3 — 1) = n. 



single gate error is larger as tested by analogous Monte- 
Carlo simulations setting Ftr = 0.99 or = 0.96 (not 
shown) . 

In the sequel, we will — for the sake of simplicity — often 
assume independent error propagation at the expense of 
systematically underestimating the pros of CISC compi- 
lation compared to the standard RISC compilation into 
universal local and two-qubit gates. 



Computational Methods and Devices 

Following the lines of our previous work on time com- 
plexity [s^l , we used the grape algorithm [4^1 for direct 
CISC compilation. It tracks the fixed final times down 
to the shortest durations of controls still allowing for 
synthesising the unitary target gates with full fidelity. 
This gives currently the best known upper bounds to 
the minimal times required to realise a target module 
on a concrete hardware setting. We extended our paral- 
lelised CH — h code of the GRAPE package described in [i^l 
by adding more flexibility allowing to efficiently exploit 
available parallel nodes independent of internal param- 
eters [5^. Moreover, faster algorithms for matrix expo- 
nentials on high-dimensional systems based on approxi- 
mations by Tchebychev scries have been developed (59j 
specifically in view of application to large quantum sys- 
tems . Thus computations could be performed on the 
HLRB-II supercomputer cluster at Leibniz Rechenzentrum 
of the Bavarian Academy of Sciences Munich. It provides 
an SGI Altix 4700 platform equipped with 9728 Intel Ita- 
nium2 Montecito Dual Core processors with a clock rate 
of 1.6 GHz, which give a total linpack performance of 
63.3 TFlops/s. The present explorative study exploited 
the time allowance of approx. 500.000 CPU hours. 



I. THE l,n SWAP OPERATION 

The easiest and most basic examples to illustrate the 
pertinent effects of optimal-control based CISC-quantum 
compilation are the respective indirect SWAPi.„ gates in 
spin chains of n qubits coupled by nearest-neighbour 
Ising interactions with Jzz denoting the coupling con- 
stant. 

For the SWAP1.2 unit there is a standard textbook de- 
composition into three CNOTs. Thus for Ising-couplcd 
systems and in the limit of fast local controls, the total 
time required for an SWAPi 9 is 3/{2Jzz), and there is 
no faster implementation [60, [fill- Note, however, that in 
systems coupled by the isotropic Heisenberg interaction 
XXX, the SWAPi_2 may be directly implemented just by 
letting the system evolve for a time of only l/{2Jxxx)- 
Sacrificing universality, it may thus be advantageous to 
regard the SWAPi^2 as basic unit for the SWAPi.„ task 
rather than the universal CNOT. Clearly, any even-order 
SWAPi.2n can be built from SWAPSi.2 along the lines 
of the most obvious scheme of Fig. 0] a. (The odd- 
order SWAPSi_2n-i follow, e.g., from SWAPi^2n by omit- 
ting qubit 2n and all the gates connected to it.) 

Moreover, the generalisation to decomposing a 
SWAPi.n into a sequence with k different SWAPi^™^ build- 
ing blocks (where j = 1, 2, . . . , fc) as shown in Fig. [5]b is 
straightforward by ensuring nik + 2X]j=i('^j — 1) = n. 
Due to its symmetry, the total duration then amounts to 

fc-i 

t(swapi.„) t(swapi,™, ) + 2 X! t(swapi.™^. ) (20) 

and the overall quality as a function of the fidelities of 
the constitutent gates reads 

fc-i 

gswAPi „ F(SWAPi.™, ) TT F(SWAPi,™, )^ 

t=i (21) 

X g— r(swAPi,„)/TH 
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number of qubits number of qubits 

Figure 5: (Colour online) (a): Times required for indirect SWAPSi,„ on linear chains of n Ising-coupled qubits by assembling 
SWAPi,m building blocks reaching from m = 2 (risc) up to m = 8 (ciSC). Using linear regression, the dashed line is an 
extrapolation of the direct single-CISC compilations shown in the inset to large number of qubits, where direct CISC compilation 
is virtually impossible on classical computers. Time units are expressed as 1/ Jzz assuming the duration of local operations 
can be neglected compared to coupling evolutions (details in the text), (b): Translation of the effective gate times into overall 
quality figures q = {QmY"' for an effective gate assembled from components of single qualities qm '■= Fm e~'^'^^'^^ (with the 
respective component fidelities homogeneously falling into a narrow interval Fm £ [0.99994, 0.99999] for m = 3, . . . , 8). Data 
are shown for a uniform relaxation rate constant of XjTn — 0.0047^2. 



Now, the SWAPi^m^. building blocks themselves can be 
precompiled into time-optimised single complex instruc- 
tion sets by exploiting the GRAPE-algorithm of optimal 
control up to the current limits of uij imposed by CPU- 
time allowance. 

Proceeding in the next step to large n. Fig. O under- 
scores how the time required for SWAPi_„ gates decreases 
significantly by assembling precompiled SWAPi^,„^ build- 
ing blocks as CISC units recursively up to a multi-qubit 
interaction size oinij = 8, where the speed-up is by a fac- 
tor of nearly 2. Clearly, such a set of SWAPi^™^ building 
blocks with rrij G {2,3,4,5,6,7,8} allows for efficiently 
synthesising any SWAPi_„. Assuming for the moment that 
a linear time complexity of the SWAPi_„ can be taken for 
granted, one may extrapolate the results of direct CISC 
compilation from the range of the inset of Fig. [5] a to a 
large number of qubits. One thus obtains an estimated 
upper limit to the time complexity of the SWAPi_„. This 
is indicated by the dashed line, the slope of which will be 
defined as Aqo- Likewise, the irrespective slopes of the 
m-qubit decomposition are denoted by A^- 

With these stipulations, we introduce as a measure for 
the potential of CISC compilation (versus RISC compila- 
tion) the ratio of the slopes 

TTcisc := -r^ (22) 

and as a measure for the extent to which this potential 
has been exhausted by m-qubit CISC compilation the ra- 
tio 

Vm ^ (23) 



thus providing as convenient measure of improvement 
A2 

£,m "T = IJm ' ""cisc ■ (24) 

The data of Fig. [5] thus give a potential of ttcisc — 
2.16; by TO = 8-qubit interactions it is already pretty 
well exhausted, as inferred from ijs = 0.87. The current 
CISC over RISC improvement then amounts to = 1-88. 

On the other hand, deducing from Fig. [5] right away 
that the time complexity of SWAPS i „ ought to be lin- 
ear would be premature: although the slopes seem to 
converge to a non-zero limit, numerical optimal control 
may become systematically inefficient for larger interac- 
tion sizes TO. Therefore, although improbable, e.g., con- 
vergence of the slopes to a value of zero cannot be safely 
excluded on the current basis of findings. This also means 
a logarithmic time complexity can ultimately not be ex- 
cluded either. 

Summarising the results for the indirect swaps in 
terms of the three criteria described in the introduction, 
we have the following: (i) in Ising coupled qubit chains, 
there is no speed-up by changing the basic unit from 
the universal CNOT into a SWAPi^2, whereas in isotropi- 
cally coupled systems the speed-up amounts to a factor 
of three; (ii) extending the building blocks of SWAPi_m 
from TO = 2 (risc) to to = 8 (ciSC) gives a speed-up 
by a factor of nearly two under Ising-type couplings; (iii) 
the numerical data are consistent with a time complexity 
converging to a linear limit for the SWAPi.„ task in Ising 
chains, however, there is no proof for it yet. 
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Figure 6: By rearranging the swaps and controlled phase gates, the standard decomposition of a 3m-qubit quantum Fourier 
transform, QFT (top trace) reduces to a realisation adapted to a coupling topology of linear nearest-neighbour interactions 
(lower trace) with a 2m-qubit QFT, m-qubit cp-SWAPs (solid boxes), and an m-qubit QFT (dashed box). The notation {kra — u) 
is a shorthand for a rotation angle of <^ = ^^Z-u ■ 



QFT,„ 



cP-SWAP; 




QFT„ 



Figure 7: For fc > 2, a (fcm)-qubit QFT can be assembled 
from k times an m-qubit QFT and (j) instances of 2m-qubit 

modules cp-SWAPj^, where the index j of different phase- 
rotation angles takes the values j = 1, 2, . . . , k — 1. The dashed 
boxes correspond to Fig. [6] and show the induction fc fc + 1. 



II. THE QUANTUM FOURIER TRANSFORM 
(QFT) 

Since many quantum algorithms take advantage of effi- 
ciently solving some hidden subgroup problem ^he quan- 
tum Fourier transform plays a central role [1, 0, [1| • 

In order to realise a QFT on large qubit systems, our 
approach is the following: given an m-qubit QFT, we show 
that for obtaining a (fc-TO)-qubit QFT by recursively using 
multi-qubit building blocks, a second type of module is 
required, to wit a combination of controlled phase gates 
and SWAPS, which henceforth we dub m'-qubit cP-SWAP 



for short. 

Here we present two alternatives: variant I with m! = 
2m and, as a special case, variant II for even m! = m. 

Choosing m = 2 and /c = 3 for a start, the recursive 
construction is illustrated in Fig. [HI The top trace shows 
the standard textbook realisation of a 6-qubit QFT. By 
shifting the final SWAP operations, it can be rearranged 
into the sequence of gates depicted in the lower trace. 
Note that the gates appearing in solid boxes constitute 
a 2TO-qubit QFT (which itself is made of two m-qubit 
QFTs and a central m-qubit cp-SWAp), while the ones 
in dashed boxes have to be added for a 3m-qubit QFT. 
For m = 2 we have thus shown how a 3m-qubit QFT re- 
duces to a 2m-qubit QFT, two 2m-qubit cp-SWAPs, and 
an m-qubit QFT. So with 2m providing a foundation, at 
the same time we have also illustrated the induction from 
a k ■ m-QFT to a (fc+ 1) • m-QFT. Moreover, the same con- 
struction principle holds for any block size m = 2, 3, . . . , 
which can readily be proven by a straightforward, but 
lengthy induction from 7ti to m + 1. 

One thus arrives at the desired block decomposition of 
a general [k ■ m)-qubit QFT as shown in Fig. [7] (which 
is variant I; the less effective variant II can be found in 
Appendix A): it requires k times the same m-qubit QFT 
interdispersed with (2) times an 2m-qubit cp-SWAP, out 
of which fc — 1 show different phase-roation angles. For 
all m and j = 1, 2, . . . , (fc — 1), one finds the following 
observations: 
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Figure 8: (Colour online) Comparison of CISC-compiled QFT (red) with standard RISC compilations (m — 2) following the 
scheme by Saito [s^j (black) or Blais [2^ (blue) : (a) times for implementation translate into quality factors (b) for a relaxation 
rate constant of 1/Tr = O.OOAJzz- Again, the dashed red line extrapolates from the direct single-CISC compilations shown in 
the upper inset of (a) , the lower inset giving in logarithmic scales the times needed for the standard textbook RISC compilation 
(m = 2) on a linear Ising chain. Dotted red lines represent the less favourable results from QFT variant II (Appendix A). 



1. a cp-swAP2,„ takes as least as long as a QFTm; 

2. a QFTm takes as least as long as a cp-SWAP:^j; 

3. a cp-SWAP:^„ takes least as long as a cp-SWAP^^ . 

Thus the duration of a (fc • m)-qubit QFT built from 
m-qubit and 2m-qubit modules amounts to 

T(QFTfc.„) ^ 2-t(qft„) + (fc - 1) • t(cp-swap^„J 
+ (fc - 2) • t(cp-swap^,„) . 

Next, consider the overall quality of a (fc • m)-qubit QFT 
in terms of its two types of building blocks, namely the 
basic TO-qubit QFT as well as the constituent 2m-qubit 
CP-SWAPS with their respective different rotation angles. 
It reads 



fe-i 



<ZQFTfc 



QFT„ 



(n(^c, 



(26) 



In the following, we will neglect rotations as soon as their 
angle falls short of a threshold of 7r/2^". This approxima- 
tion is safe since it is based on a calculation of a 20-qubit 
QFT, where the truncation does not introduce any rela- 
tive error beyond 10~^. According to the block decom- 
position of Fig. [3 thus three instances of cp- swaps are 
left, since all cp-SWAP-Jq elements with j > 3 boil down 
to mere SWAP gates due to truncation of small rotation 
angles. The representation of these cP-SWAP modules is 
shown in Appendix B as Fig. 1171 

With these stipulations, we address the task of assem- 
bling an (fc-lO)-qubit QFT, exploiting the limits of current 



allowances on the hlrb-ii cluster. This translates into 
using 10-qubit cP-SWAP building blocks {2m = 10) and 
the 5-qubit QFT (m = 5) in the sense of a (2fc • 5)-qubit 
QFT. Its duration T(QFT2fc.5) is readily obtained as in 
Eqn. [25] thus giving an overall quality of 



9QFT2 



-swap!,-, ) ' (^CP-SWAP? J 



\2k-2 



1 ' 



-4/C+3 



JTr 



(27) 



Based on this relation, the numerical results of Fig. [8] 
show that a CISC-compiled QFT is moderately superior to 
the standard RISC versions [2^ [g^I ■ Although the poten- 
tial of CISC compilation amounts to ttcisc = 2.27, recur- 
sively assembling 5-qubit QFTs and 10-qubit cp-SWAPs 
only exploits about half of it as apparent in the value of 

^75,10 ^ 0-53. 

As has been pointed out by Zeier [63|, the decompo- 
sition of a many-qubit QFT into smaller QFTs and con- 
catenations of a permutation matrix and a diagonal ma- 
trix roots back in aprinciple already used in the Cooley- 
Tukey algorithm [9[ for the discrete Fourier transform 
(dft): Let N ^ m ■ q. Then one obtains [6i.[65t 



DFTat = Lo (UFTm Iq) o D o (1„ (g) DFTg) 

= (Ig (g) DFT™) o {Lo D)o (1,„ (g) BFTq) 



(28) 



where L £ Mat is a permutation matrix. Moreover, 
setting u; := e^^*/^, the diagonal matrix takes the form 

D = diag(w*'=|tfe = (fcmod to)L-J for fc 0, 1,2, . ..N-l) 

(29) 

Therefore, the QFT decompositions made use of here 
exactly follow the classical scheme in the second line of 
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Figure 9: Decomposition of a C^-NOT with one ancilla 
qubit into four C^-NOTs (Toffoli gates) according to ref. 
States jo) through |e) are explained in the text. 



Eqn. [28l the expression {L o D) corresponding to the 

CP- SWAP. 



Fig. [To] provides a generalisation of the scheme in 
Fig.O in the first place (a), the number of control qubits 
is reduced by introducing k blocks with ma qubits that 
are left invariant. The price for this reduction is a four- 
fold occurence of the reduced building blocks. In the sec- 
ond step (b), the reduced building blocks are expanded 
into a sequence with two central C™^nots, two terminal 
c('"3+i)nots and two lots of 2{k - 1) times c(™=»+^)not 
each. For k = 4, 5, 6, . . . part (a) and (b) can be ex- 
panded in a general concatenated way thus entailing an 
overall duration of 



t(c" ^not) = 4t(c'"^not) 

fc>4 

-I- 4t(C™^NOt) -I- T(SWAPi^„iJ 

+ (13fc-8)T(c''"^ + lNOT) 

+ (1 - Srn,,l) (13fc + 3)r(SWAPi,,„3) 



(30) 



III. THE MULTIPLE-CONTROLLED NOT 
GATE (C"NOT) 

Multiply-controlled CNOT gates generalise Toffoli's 
gate. Here, we move from C^not to c"~^not in an 
n-qubit system with one ancilla and one target qubit. 
The reason for the ancilla qubit being that it turns the 
problem to linear complexity [20| . Moreover, in view of 
realistic large systems, we assume again a topology of 
a linear chain coupled by nearest-neighbour Ising inter- 
actions. Since C™NOT-gatcs frequently occur in error- 
correction schemes, they are highly relevant in practice. 

Here we address the task of decomposing a c"~^not 
into lower CNOTs and indirect SWAP gates (see Sec.|l|. 

To this end, we will generalise the basic principle of re- 
ducing a c"not to c'"not gates with m < n that can be 
demonstrated by decomposing a C'^not into Toffoli gates 
according to scheme of Fig. [9] devised by Barenco et al. 
in [l^l . Starting with any of the 2^ computational basis 
states |a;i, a;2, a;3, X4, xs) (where Xk € {0,1}, © denotes 
addition mod 2, and XkX( being the usual scalar prod- 
uct) track the effect of the gate sequentially from state 
I a) through state |e) 

\a) = \xi,X2,xz,Xi,x^) 

\h) =\xi,X2,Xz®XiX2,Xi,X5) 

\c) = |.Ti,X2,X3 © xia;2,a;4,X5 © Xi{xz © xiX2)) 
= |xi,X2,a;3 © xiX2TXi,xz © a;4a;3 © X1X2X4) 

\d) = \xi, X2t Xz® XiX2® XiX2TXi,Xz® X^Xz® XiX2X4) 

= |a;i,X2,X3, Xi, x^ © x^Xz © X1X2X4) 

|e) = |xi,X2,a;3, x^, x^ © x^x^ © a;4a;3 © X1X2X4) 
= |a;i,X2,a;3,a;4,a;5 © xiX2Xa) 

to see the overall effect of the gate sequence is a C'^NOT 
thus proving the decomposition. 



For completeness, note that the cases k = 3, 2, 1 have to 
be treated separately, since they only allow for less and 
less densly concatenated expansions (not shown). Their 
respective durations are 

r(c"^^NOT)|^.^^ = 4r(c'"iNOT) -f- 2r(sWAPi^TOj+i) 
+ 4r(c™"NOT) + 2t(swapi,™2) 
+ 8t(c™^+1not) + (1 - 5,„3,i) 16t(swapi,,„3) 

(31) 



t(c" ^not)|^^2 = 4t(c"'^not) 

+ 4t(c™" not) + r(swAPi,,„, ) (32) 
+ 24t(c™-^+1not) + (1 - (5,„3,i) 32r(swAPi,™3) 

r(c""^NOT)|^^g = 4t(c™'not) 
-t- 4t(c™^not) 4- t(swapi,to2) 
+ 37t(c™^+1not) + (1 - d™34) 48t(swapi,™3) . 

(33) 

However, the total number of gates only depends on 
= 1, 2, 3, . . . , so that obtains as the overall quality 

^Ifc ^ (-P'c"iWOt)'^(-Fswapi,„j + i)'' 
X (i^C'"2ArOT)'*(-Fs\VAPi,„2 )^ 

X (i'c-r.s + l^oy) (-f'sWAPi.„3)' 

-t(C"'^NOT)\^/Tr 

(34) 



X e 



Given the duration of the decomposition as in Eqn. I30[ 
it is easy to see that implementing the mi control qubits 
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Figure 10: Decomposition of a c"~^not gate on a linear coupling topology: (a) reduction of the number of control qubits to four 
intermediate gates with fewer control qubits and (b) decomposition of the intermediate multiply-controlled NOT-gate appearing 
in (a). In an n-qubit system, there is one target qubit, one ancilla qubit and n~2 control qubits; so mi + {m2 — l) + 2kms = n — 2 



with m\,m3 > 1 and m2 > 2. Read the brackets 
(a) 



]fe in (a) as to be expanded k times and {■)k in (b) as expanded k-fold. 
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Figure 11: (Colour online) Comparing implementations of c"~^nots on a linear Ising spin chain using CNOT and SWAPi,2 
modules for the RISC compilation or multi-qubit building blocks according to the CISC assembler scheme of Fig. 1101 As a 
short-hand, the different numbers of control qubits are expressed by m :— {m\,m2,mo,). Using the expansion of Fig. 1101 the 
CISC results (black solid lines) are obtained for k = 4, 5, 6, . . . with mi = 8 and m2 = 8 (for odd n) or m2 = 7 (for even n), 
while mz = 1 thus ensuring mi -|- (m2 — 1) -I- 2krm — n — 2. The red dotted line extrapolates again the direct CISC results 
beyond 10 qubits. In (a) deviations from straight lines occur, as the cases A; = 1, 2, 3 follow special concatenation patterns (see 
text), while fc = 4, 5, 6, . . . are generic. The inset in (a) also shows results of a non-scalable recursive expansion that is confined 
up to 19 qubits (blue circles). The step functions with periods indicated by tags represent a faster alternative explained in the 
next section, where the boxed part of trace (a) is blown up in Fig. 1151 



comes with the lowest time weight (4) and without a (4), but entails the time for one auxiliary SWAPi^m^ gate, 
time overhead of auxiliary gates. Implementing the TO2 In order to implement the k ■ control qubits, in turn, 
control qubits, however, requires the same time weight a sizeable amount of auxiliary swaps are needed. 
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Therefore, whenever high fideUties can be reached (so 
that the quahty is Umited by relaxation not by fidehty), a 
good strategy of combining the expansive decomposition 
in Fig. [TO] a with the recursive decomposition in part 
(b) is the following: given n — 2 control qubits and with 
the current limitation from direct CISC compilation being 
rrij < 9, choose mi to be the largest, m2 to be the second 
largest and such that one obtains an even number for 
n — mi — 7712 ^ 1 = 2km^. 

In the next step, a decision has to be made in or- 
der to minimise the contributions in the last two lines 
of Eqn. I30[ whenever there are several integer solutions 
km^ = fc'mg. So for integer fc > 4, this amounts to the 
ordinary minimisation task 



min (13fc - 8) {(mg + 2)Ac + a] 

k 



+ (1 - <5™3,i) (13fc + 3) {mgAs + b} 



, . , , , , n-(mi+m2 + l) 
subject to: K7713 = const. = 



(35) 



Here we approximate the times for a C™^+^not by the 
linear expression r(c'"^"'"^NOT) = (7713 + 2)Ac + a and 
likewise for the SWAPi^,„3 by t(swapi.,„3) = maAs + b 
with the values for the slopes AcAg and the offsets a 
and b being taken from the respective linear regression for 
extrapolating Aqo for direct CISC compilation (A^-* — 

2.15 and a1?°^ = 0.69 as wch as a = -4.48 and b = 0.06). 
In the setting of these parameters, Eqn. [3D] implies it is 
timewise advantageous to choose as the decomposition of 
the interior block in Fig.[TO]b the counter-intuitive option 
with a large number k of small block sizes ma. This 
is because in the above parameter setting, the duration 
takes its minimum on the margin circumventing the time 
overhead skipped by (1 — ^^3.1) = thus giving high 
repetitions k = "~'-™^^™^"'""'"^ and smallest block sizes 
ma = 1 corresponding to Toffoli gates. The speed-up is 
illustrated in Fig. [TlJ although it amounts to a factor of 
2.45 compared to the standard RISC decomposition, the 
potential as extrapolated from direct CISC compilation 
up to nine qubits gives a lower bound for the speed-up 
by 13.6. 



Faster Alternatives of C"^ ^not 

Since the potential of CISC compiling c"not gates is 
largely not yet exploited by the previous scheme, it is 
worth showing a faster scalable decomposition at the ex- 
pense of being more elaborate. To this end, we proceed 
in two steps, first we show the general principle of an 
auxiliary backbone gate, namely an indirect CNOT be- 
tween qubit 1 and some distant qubit ^ + \ (which may 
be separated by I intermediate qubits, e.g., in a linear 
coupling topology). Second we implement the resulting 
faster alternative into Fig. [TO] b. 



Figure 12: Principle of generating an indirect CNOT by re- 
stricting i = 2"^ as, described in the text. 



Fig. [12] shows the principle identities: if i is an integer 
power of two, the two identities hold for any £ = 2'p . The 
second identity is easy to see since the ascending series of 
CNOT gates can be represented by a Jordan matrix over 
the field of binary numbers 'L^^^^ ■= {0, l}^"*"^ with the 
addition modulo 2 so as to take the form 



/I 1 • 


• 0\ 


1 1 • 


• 


11- 


• 


1- 


• 


0- 


• 1 1 


0- 


• 1 1 


Vo • 


•001/ 



(36) 



In terms of natural numbers, its i power reads 



/I (0 (D 

1 f' 













\2) 

1 



-3) {e-2) 
) (.-3) 



-2 

{1-3. 

{l~5 



ii) 
1 





(37) 

For £ = 2P with p = 1, 2, 3, 4, . . . it gives the desired in- 
direct C^'^"'"^NOT as seen in the representation over Zj"*"^ 



/I 

10 

10 

1 



1\ 









(c1/+1not)L+i 



••• 1 
••• 1 
\0 ••• 1/ 

_ (38) 

This is because due to a theorem by Lucas [63 only for £ 
being an integer power of two, all the binomial coefficients 
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Q with j = 1, 2, . . . , {i~ 1) are even, while Q = Q = 1. 
They are therefore the only ones not to vanish in the 
representation over 1^.^^ . 

The principle backbone summeriscd in the identities of 
Fig.[H]niay then be extended for more general purposes: 
(1) Without changing I one may insert further control 
qubits in the sense of replacing any CNOT by a Toffoli 
or a higher C'"not. (2) Likewise, one may formally in- 
sert further target qubits to be flipped so that the not 
component is performed on more than one qubit. These 
two extensions enable a faster alternative for decompos- 
ing the module of Fig. [10] a than given in Fig. [10] b. This 
alternative is shown in Fig. [T3] a. Due to the backbone 
scheme of Fig. [12] the only constraint is that '1 plus the 
number of neutral blocks of size tosj (represented by 
dashed boxes in Fig. [13] a) equals 1 = 2"^ with p S N'. 
Changing the assembly of a c"^^not from the scheme of 
Fig.[TO]to the alternative of Fig. [13] follows by identifying 
A; = 1 = 2P - 1, i.e. 



(a) 



(b) 



2 = TOi -|- (m2 — 1) -I- 2fcm3 



2(2^-1) 

77Zi -1- (m2 - 1) + ^ m3j- 
i=i 



(39) 



where we explicitly allow for individual block sizes m3 j-. 
As shown in Fig. [13] b, the decomposition of ms.j+i 
control qubits (solid boxes) and TOa.j spacer qubits 
(dashed boxes) leads to an auxiliary gate, which wc term 
c(i+™3,j+i)NOT('"=».j). It can be realised as in Fig. [14] 
Note that the construction scheme of Fig. [TU] a requires 
to each solid box an equally sized dashed box. 

In order to express the overall duration, we need the 
following notation: let the array TO3 := (r7i|, to|, rrig, m|) 
of total length 2£ — 2 comprise the box sizes ma.j of 
Fig. [13] b grouped into the four subsets 

■m\ : sizes of the | solid boxes on the left, 

m| : sizes of the solid boxes on the right, 

: sizes of the | dashed boxes on the left, and 

m| : sizes of the dashed boxes on the right, 

and let to| be the largest entry in m|, 5=1,2, 3, 4. Then 
the duration of the decomposition of a c"^^NOT-gate of 
Fig. [To] a according to Fig. [13] and [14] reads 



t(c"-2not) = 2 • 2f (t(c"^+^not) -f r(c'"'+iNOT) 

-I- 2 max(m3, m|) • t(cnot) 
2 • 2f (t(c"^+1not) -f r(c"^+iNOT) 
-I- 2 max(TO3, TO3) • t(cnot) 

+ TswAPi „ ■ 



1-2 



l-l 



J = l 



j=2e-3 



' =21-2 



"'3,1 



a times 



Figure 13: (a) Alternative decomposition of the constituent 
module of Fig. [10] a. The qubits tagged by the numbers 
1,2, ... ,^ relate to the neutral qubits of Fig. [12] (b) Each 
auxiliary building block involves a solid box containing msj+i 
control qubits and a dashed box containing msj spacer 
qubits. It is termed c'^+'^^'^+^'not'^^ j' and its realisation 
is shown in Fig. 1141 



Figure 14: Realisation of the auxiliary building block involv- 
ing several NOT actions. 



Obviously Eqn. [40] is symmetric under the exchanges 



and 



'3' 



while the max-functions break 



(40) 



a full symmetry that would also require invariance under 
m|. Consequently, the broken 
symmetry imposes rules how to increase the box sizes in 
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a time saving way. However, since the duration is limited 
by the largest box size in each time slot (left part and 
right part in Fig. [13] b). one can fill the left and the right 
slots sequentially. 

Given n—2 control qubits, a time saving decomposition 
of a c"^^NOT results by following the subsequent rules: 
calculate the auxiliary variables p' := [log2(n — 1)J — 1 
and r := (n — \) — 2^ ^'^ to determine p as 

1. p = p' - 1 for p' > 2 and r € [1, 2^'-^] entailing 
msj S {2, 3} and the jump at half width within 
step I (see Fig. [T5)) : 

2. p = p' - 2 forp' > 3 andr e [1 + 2I''-^,5-2p'-^] en- 
tailing ma.j G {5, 6} and the minor jump at quarter 
width within step II (Fig. [T5|) : 

3. p = p' otherwise; then rrisj G {I7 2}. 

4. NB: for p — p' — 3 blocksizcs would increase to 
TO3j G {8,9} leading to C^not and C^"not build- 
ing blocks, which are currently out of reach. 



iiooh 
1000 

900 



60 



80 90 100 110 

number of qubits 



Figure 15: (Colour online) Detailed blow-up of the box in 
Fig. 111! The step function is a periodic repetition of steps 
I-IV, where the length of the steps (in units of numbers of 
qubits) doubles with every period as indicated by the tags in 
Fig. 111! The small jumps within steps I and II are explained 
in the text. 



Then a time-saving decomposition obeys the final rule 

5. Once p and the box sizes m^ j G {b,b+ 1} are fixed, 
arrange the vector of grouped sizes 



= {b+l,b+l,...,b+l,b,b,.. 
with the entries in descending order. 



,6,6) 



Example 3 Finally, in a system of n ~ 137 qubits a 
C^^^NOT gives p' = 6 and r = 8 G [1, 2^'"^] = [1, 16], so 
rule 1 applies and p = 5. Eqn. \39\ and rule 5 then give 
ml = ((3)8,(2)8) and ml = (2)i6, m§ m| (2)15, 
mi = 777.2 = 2. Therefore assembling a C^'^^NOT refers to 
C^NOT and C^NOT gates. 



Clearly, the duration will not increase as long as all en- 
tries (64-1) fall into 7773, where one may choose to start on 
top of Fig. [131 A time-step will occur as soon as the first 
(6 -I- 1) falls into 7n,§, which neither affects max(7773, 7773) 
nor max(77i|, TTif). Analogous features hold for filling 7773 
and 7773. They bring about the periodic step function 
shown in Fig. [TT] Its details are given in Fig. [13 where 
the jump within step I is due to rule 1, while the minor 
jump within step II has its roots in rule 2 above. 

For illustration, consider the following three cases: 

Example 1 In a system of n = 41 qubits a C'^^NOT with 
one auxiliary qubit gives p' ~ A. By r = 8 rule 3 applies 
and p = i. Eqn. \ 39\ and rule 5 then yield rn^ = (2)8 f?^ / 
and 777.3 = (1)3, while 7773 = rn^ = (1)7 and mi = 7772 = 1. 
Hence the time-saving decomposition involves just CNOT 
and Toffoli gates (beyond the auxiliary CNOT a77c? indirect 
SWAP gates). 

Example 2 Yet for n ~ 42 qubits a c"'°NOT gives p' ~ A 
and r = 9 € [1 + 2p'-^,5 
applies and yields p — 2. 
finds rn\ = (6, 5) and rhl ■ 
and rn^ — (5) with mi — 5 



2P'-^] = [9,10], so rule 2 
By Eqn. I j'ffl and rule 5 one 
- (5, 5) as well as rn^ ~ (5) 
and 7772 = 5. So the c'^'^not 



Finally, the times of Eqn. [3D] as well as the decompo- 
sition schemes of Fig. [TUja, Fig. [01 and Fig. [T^ translate 
into the respective quality factors as 



q(c" ^NOT) - (i^ci + ^iNOT • -P]::i + '"2NOT ' -^c^ + ^a.! not) 
f-1 



2P+1 



1 NOT/ 



2"+" -(ma 



X (-F;,1 + ™3,2.-2^ot) 

X n { (^cl + '"3,2f + 2-2jf, 
J = 2 

X (F X (F ^2 X p-^(c""'not)/Th 



2''+'-(m3,2f + l-2j-l) 



decomposes favourably via C^not and C^NOT gates. 



(41) 



The corresponding numerical quality results have already 
been shown in Fig. [Til b as step function. They repre- 
sent compilations with building blocks using multiply 
controlled subblocks with up to 6 and 7 control qubits 
thus giving another significant improvement over the 
assembly scheme described in the previous subsection. 
The results arc also summarised in Tab. [TTl 



} 



16 



IV. IMPLICATIONS FOR 
MULTIPLY-CONTROLLED GENERAL UNITARY 
OPERATIONS 

The fast assembly schemes of multiply controlled not 
gates given in the previous subsection also allow for faster 
realisations of multiply controlled general unitary gates 
than in the classic of Barenco, Bennett et al., [2(j |. 



Lemma 1 Recall: every self-inverse 1-quhit special uni- 
tary U+ = U^^ € SU{2) is trivially ±1, while every 
self-inverse C/_ G U{2)\SU{2) is unitarily similar to a^- 

Proof: To see the second assertion constructively, ob- 
serve that any self-adjoint C/_ e U{2) \ SU{2) shows 
det t/_ = —1 and may thus be represented as a pure 
quaternion [/_ = x-ax+y-<Jy + z-az with x'^-\-y'^-\-z'^ = 1. 
Ensuring \a\'^ + \b\'^ = \ in V e SU{N), it can be identi- 
fied with 

J) (r* "!) 

via X = Re(a2 - b'^), y = lm{b'^ - a^), z = 2Re(a6*). ■ 



Corollary 2 Given a realisation of a C"~^NOT in time 
t(c"~^NOt) on an n-qubit system with one auxiliary and 
one target qubit. Then the realisation of a multiply con- 
trolled general unitary C"~^U takes the time 

1. t(c"-2u) < t(c"-2not) + t{V) + T(yf), 

ifU& U{2) \ SU{2) is self-inverse = h and 
V £ SU{2) as in Eqn. 

2. t(c""2U) < 2 • t(c"~2not) + t(3 local gates), 
ifUe SU{2) and ^ 1; 

3. Assertion (1) can he generalised to multiply- 
controlled {q + \)-qubit self-inverse unitaries of the 
form U- = V{(j,x ® laO^^ with V e S'[/(2«+i). 

Proof: 

(1) The inequality is a direct consequence of applying 
Eqn. 221 to the NOT operation on the target qubit. This 
is qubit 1 in Fig. [10] a, which for later convenience be- 
comes qubit in Figs. [T^and[T51 (Moreover, by revers- 
ing Eqn. [42] to Gx = V^U-V and using it on qubit in 
Fig. [T^l one can absorb the time for at least one of the 
local gates V by virtue of the decomposition of F ig.lT5l b.) 

(2) Direct consequence of Lemma 5.1 in Ref. [20|. 

(3) Obvious generalisation of Eqn. [42] with q further 
qubits added — e.g., on top of qubit 1 in Fig. [TO] a. ■ 

Similar generalisations hold for further special cases ad- 
dressed in Ref. [13 Sec. 5.2. 



V. CONCLUSIONS AND OUTLOOK 

We have exploited the power of a cutting-edge high- 
performance parallel cluster for quantum CISC compi- 
lation. Thus the standard toolbox of universal quan- 
tum gate modules (risc) has been extended by time- 
optimised effective multi-qubit gates (ciSC). We have 
shown ways how these CISC modules can be assembled in 
a scalable way in order to address large-scale quantum 
computing on systems that are too large for direct CISC 
compilation. Although our optimal-control based CISC- 
compilation routine exploits parallel matrix operations 
for clusters as well as fast matrix exponentials [sst . in- 
creasing the system size by one qubit roughly requires a 
factor of eight more CPU time. Since direct CISC compila- 
tion thus scales exponentially, scalable assembler schemes 
for multi-qubit CISC modules arc needed, and we have 
presented some elementary yet important ones: 

For indirect swaps, the quantum Fourier transform, 
and multiply-controlled not -gates, the CISC decomposi- 
tion is significantly faster than the standard RISC decom- 
position into local plus universal two-qubit gates. The 
current improvements range from 20% up to a speed-up 
by nearly 300%. However, as illustrated in Tab. [Til the 
potential of CISC compilation is by far not yet exhausted 
with the current schemes. — As a noteworthy side re- 
sult, we have shown that gate errors in multi-qubit CISC 
modules propagate more favourably than in RISC mod- 
ules confined to two-qubit gates. 

Assembling pre-compiled effective multi-qubit modules 
has further advantages beyond saving gate time: a prob- 
lem common to many implementations occurs as soon as 
smaller decoherence-protected modules shall be embed- 
ded in larger eff'ective systems. Usually dissipative cou- 
pling to a new environment also introduces new sources 
of decay the original module has not been optimised for. 
Therefore, practical handling is greatly facilitated, if the 
m-qubit modules with tailored optimisation under dissi- 
pation and decoherence (see, e.g., [i^) extend to larger 
units of rclaxatively interacting qubits than the standard 
of m = 2. This advantage can readily be envisaged by 
a quantum- information processor, e.g., based on trapped 
ions, where the 'passive qubits' are stored with spatial 
separation from the currentl y 'a ctive ones' in the pro- 
cessing unit (see, e.g., (67l[68ll69j for overview and recent 
developments). 

Moreover, controlling physical m-qubit modules will 
also allow for encoding logical qubits with specifically tai- 
lored optimisiation under more realistic relaxation mod- 
els [m than in ideal 'decoherence-free subspaces'. 

This paves the way to another frontier of research: op- 
timising the quantum assembler task on the extended 
toolbox of quantum CISC-modules with effective many- 
qubit interactions. Finally, it is to be anticipated that 
methods developed in classical computer science can also 
be put to good use for systematically optimising quantum 
assemblers. 
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Table II: Current Exploitation vs. Potential of Quantum CISC Compilation (Limit of Fast Local Controls) 



gate 


CISC potential: 


estimate 


current status: 


exploitation 


improvement 






TTcisc ~ A2 / Aoo 






= A2/ Am 


SWAPi,„ 


medium 


2.16 


fairly exhausted 


0.86 [m = 8] 


1.88 


n-QFT 


medium 


2.27 


halfway exhausted 


0.53 [m = (5qft, lOcp-swAp)] 


1.20 


C"~^NOT 


big 


13.6 


not yet exhausted 


0.18 [m = (8, 8, 1)„ odd or (8, 7, 1)„ even] 


2.45 










0.25-0.31 [m = (< 7, < 6, < 6)] 


3.45-4.19 
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APPENDIX SECTION 



A. Variant-II of Scalably Assembling a Quantum Fourier Transform 




Figure 16: For k > 2 and even m, a (fcm)-qubit QFT can be assembled from k times an m-qubit QFT and 4(2) instances of 
m-qubit modules cp-SWAP^, where the index j of different phase-rotation angles takes the values j = 1, 2, . . . , 2fe — 1. The 
dashed boxes correspond to Fig. [7] and show the induction k ^ k + 1. 



B. Controlled-Phase-SWAP Modules for k 10-Qubit QFTs 




Figure 17: Equivalent circuit representations of the three 10-qubit cp-SWAP modules needed for a fc- 10-qubit QFT, when rotation 
angles less than 71/2^" are omitted (as described in the text): (a) cp-SWAPiq with no truncation so Ftr = 1, (b) cp-SWAP?o with 
Ftr = 0.9999902 , and (c) cp-SWAP^q, which covers all j > 3 with fidelity Ftr > 0.9999991. These building blocks are compiled 
directly as effective 10-qubit modules thus reducing the time to less than 60% of the duration required for the decomposition 
into 2-qubit modules. 



